Here, we take a raw RNA-seq counts table through quality assessment, filtering, and normalization. For more information on how counts were generated, see our SEAsnake pipeline. These data are human, bulk, paired-end RNA-seq, but this pipeline can be applied to other organisms or single-read libraries.
This pipeline should be completed in R and RStudio. You should also install the following packages.
#CRAN packages
install.packages("tidyverse")
#Bioconductor packages
install.packages("BiocManager")
BiocManager::install(c("edgeR", "limma", "biomaRt"))
And load them into your current R session.
#Data manipulation and plotting
library(tidyverse)
#RNAseq data
library(edgeR)
library(limma)
#Note we do not load biomaRt because it has conflicts with the tidyverse.
# We instead call its functions with biomaRt::
# Set random seed for reproduciblity
set.seed(651)
To directly use the code in this pipeline, you must organize your files as follows.
Using the tidyverse read_ functions you saw earlier today, read in the counts and sample metadata files.
count <- read_csv("0_data/raw_counts.csv")
## Rows: 44786 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): hgnc_symbol
## dbl (20): pt01_Media, pt01_Mtb, pt02_Media, pt02_Mtb, pt03_Media, pt03_Mtb, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
meta <- read_csv("0_data/metadata.csv")
## Rows: 20 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): libID, ptID, condition, sex, ptID_old
## dbl (2): age_dys, total_seq
## lgl (2): RNAseq, methylation
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We assess library quality using metrics from samtools flagstat and picard. Here, we will only look at total sequences for brevity.
Using ggplot, let’s plot the total sequences per library.
ggplot(meta, aes(x = libID, y = total_seq)) +
geom_col()
The above plot is somewhat difficult to read so we modify it with some ggplot features you’ve seen before and some that are new.
ggplot(meta,
#Reorder x axis by y-axis value
aes(x = reorder(libID, total_seq),
y = total_seq)) +
geom_col() +
#Add cutoff line
geom_hline(yintercept = 1E6, color = "red") +
#Set theme
theme_classic() +
#Relabel axes
labs(x = "Library", y = "Total sequences") +
#Rotate x-axis labels
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Overall we see that all libraries have more the 1 million sequences. Based on this and other quality metrics not shown here, we determine that all libraries pass quality check.
Standard differential expression analyses focus on protein-coding genes as these RNA products are the most likely to result in a measurable phenotype. We annotate the counts table genes to their biotypes as well as additional identifiers with biomaRt.
First, load the gene data base from biomaRt. Note that here we use biomaRt::function( ) because we did not load this package with library( ) earlier.
#Get database
ensembl <- biomaRt::useEnsembl(biomart="ensembl",
dataset="hsapiens_gene_ensembl",
mirror = "uswest")
#Get gene key
key <- biomaRt::getBM(attributes=c("hgnc_symbol", "gene_biotype"),
mart=ensembl)
Also, you could extract more than just the biotypes from biomaRt. See all the possible data with listAttributes.
attrib <- biomaRt::listAttributes(ensembl)
head(attrib)
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_gene_id_version Gene stable ID version feature_page
## 3 ensembl_transcript_id Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5 ensembl_peptide_id Protein stable ID feature_page
## 6 ensembl_peptide_id_version Protein stable ID version feature_page
Next, filter the key to protein-coding genes present in the RNA-seq count data.
key.filter <- key %>%
#Filter genes in count table
filter(hgnc_symbol %in% count$hgnc_symbol) %>%
#Filter protein-coding
filter(gene_biotype == "protein_coding")
And filter the count table to genes in the protein-coding key.
count.pc <- count %>%
filter(hgnc_symbol %in% key.filter$hgnc_symbol)
This removes 26518 genes from this data set and leaves 18268 genes for analysis.
Sometimes one or more libraries will appear as outliers in PCA. We define this as any library more than 3 standard deviations away from the mean PC1 and/or PC2.
First, we must reformat the counts data to a log counts per million (cpm) matrix in order for PCA to work. We also transpose it so that genes are columns and libraries are rows. Without transposing, we’d get 1 dot per gene instead of 1 per library.
count.for.pca <- count.pc %>%
#move HGNC gene symbol to rownames
column_to_rownames("hgnc_symbol") %>%
#convert to log cpm
cpm(log=TRUE) %>%
#transpose
t()
count.for.pca[1:3,1:3]
## A1BG A1CF A2M
## pt01_Media 1.835063 -1.865271 7.389832
## pt01_Mtb 1.535061 -1.865271 6.397031
## pt02_Media 1.369070 -1.865271 6.106875
Then, we calculate PCA on these transformed data.
PCA <- prcomp(count.for.pca, scale.=TRUE, center=TRUE)
A basic ggplot of these PCA data looks like so, but it does not tell us anything about outliers.
#Determine percent explained for each PC
summary(PCA)$importance
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 67.41996 46.09409 38.32512 36.86756 34.23585 31.11334
## Proportion of Variance 0.24882 0.11631 0.08040 0.07440 0.06416 0.05299
## Cumulative Proportion 0.24882 0.36513 0.44553 0.51993 0.58409 0.63709
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 28.79221 28.08592 26.04112 24.22482 22.73842 22.68737
## Proportion of Variance 0.04538 0.04318 0.03712 0.03212 0.02830 0.02818
## Cumulative Proportion 0.68247 0.72565 0.76277 0.79489 0.82319 0.85137
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 21.77168 20.72515 20.25228 19.57118 19.30086 18.76942
## Proportion of Variance 0.02595 0.02351 0.02245 0.02097 0.02039 0.01928
## Cumulative Proportion 0.87732 0.90083 0.92328 0.94425 0.96464 0.98393
## PC19 PC20
## Standard deviation 17.12298 0.664296
## Proportion of Variance 0.01605 0.000020
## Cumulative Proportion 0.99998 1.000000
#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")
ggplot(as.data.frame(PCA$x),
aes(x=PC1, y=PC2)) +
geom_point(size=3) +
labs(x=PC1.label, y=PC2.label)
We need to calculate the means and standard deviations of our PCs and create a new variable to color the points in PCA.
#calculate mean and sd for PC1 and PC2
mean1 <- mean(PCA$x[,1])
sd1 <- sd(PCA$x[,1])
mean2 <- mean(PCA$x[,2])
sd2 <- sd(PCA$x[,2])
#Challenge! Calculate mean and sd for all PCs with tidyverse summarize()
#create variable for outliers
PCA2 <- as.data.frame(PCA$x) %>%
mutate(outlier = ifelse(PC1 > mean1 + 3*sd1 | PC1 < mean1 - 3*sd1 |
PC2 > mean2 + 3*sd2 | PC2 < mean2 - 3*sd2,
"outlier","okay"))
Then our PCA plot reveals that there are no outliers. Note we added some additional ggplot customization to beautify this plot.
ggplot(PCA2, aes(x=PC1, y=PC2)) +
geom_point(aes(color=outlier), size=3) +
labs(x=PC1.label, y=PC2.label) +
#Set theme
theme_classic() +
#set xy ratio
coord_fixed(ratio=1) +
#change colors
scale_color_manual(values = c("okay"="grey70",
"outlier"="red"))
We recommend that you initially remove dramatic outliers but leave those that are borderline or questionable. Then, you can re-assess outliers after gene filtering and normalization. You may find that some are no longer outliers after these steps. If they are, you can return to this step and remove them before repeating subsequent steps.
This workshop covers the minimum quality control and filtering needed for RNA-seq libraries. Checkout our full tutorial for additional QC metrics, batch effects, custom plotting functions, and more!
At this stage, we’ve completed sample filtering and can collapse our count and meta data into a single list object. This allows us to shorten our long object names as well as works efficiently with the remaining cleaning steps.
We merge into a DGEList object, which is edgeR format. We must convert the counts data frame to a matrix for this process.
count.pc.mat <- count.pc %>%
column_to_rownames("hgnc_symbol") %>%
as.matrix()
dat <- DGEList(counts=count.pc.mat, samples=meta, genes=key.filter)
Low abundance (small counts) and rare genes (many 0 counts) are removed from the data because they:
Our goal is to remove genes in the lower left of the mean-variance plot because counts (x) and variance (y) are low e.g. these genes break the red mean variance trend line.
Here we co-opt limma’s voom function to make out mean-variance plot for us.
mv1 <- voom(dat, plot=TRUE)
We use our custom function to retain only genes that are at least min.CPM counts per million in at least min.sample number of samples OR in at least min.pct percent of samples. Here, we use 0.5 CPM in at least 3 samples.
dat.abund <- RNAetc::filter_rare(dat, min.CPM = 0.5, min.sample = 3,
gene.var="hgnc_symbol")
## 4849 (26.54%) of 18268 genes removed.
Plotting the filtered data, we see the red trend line no longer curves down on the left. There is still a bit of the downward tail of dots but this filtering is sufficient.
mv2 <- voom(dat.abund, plot=TRUE)
This filtering generally results in the removal of around 25% of genes. There is no exact cutoff for this filtering, and you should try several cutoffs to observe the effects. In general, we use minimum CPM from 0.1 - 1, minimum samples around 3 for small data sets, or minimum samples from 5 - 10% in larger data sets. It is important to keep the minimum sample cutoff larger than your smallest group of interest, otherwise you may filter genes specifically associated with one group. For example, if you have 10 libraries with 5 each of media vs stimulated, you’re min.sample value should not be > 5 or else you will remove genes only expressed in one of media or stimulated groups.
You may also wish to look for specific genes of interest and ensure they are not being filtered. The following calculates some statistics on filtered genes and saves a csv for your perusal.
rare <- as.data.frame(cpm(dat$counts)) %>%
#Filter genes removed
rownames_to_column("hgnc_symbol") %>%
filter(!(hgnc_symbol %in% rownames(dat.abund$counts))) %>%
#Add gene symbols
left_join(dat$genes, by = "hgnc_symbol") %>%
#Calculate mean, min, max and total libraries
pivot_longer(-c(hgnc_symbol, gene_biotype)) %>%
group_by(hgnc_symbol, gene_biotype) %>%
summarise(mean.CPM = mean(value, na.rm=TRUE),
min.CPM = min(value, na.rm=TRUE),
max.CPM = max(value, na.rm=TRUE),
express.in.libs = length(value[value > 0]),
.groups="drop")
write_csv(rare, file="3_RNAseq_cleaning/P259_rare_genes.csv")
rare
## # A tibble: 4,849 × 6
## hgnc_symbol gene_biotype mean.CPM min.CPM max.CPM express.in.libs
## <chr> <chr> <dbl> <dbl> <dbl> <int>
## 1 A1CF protein_coding 0.0404 0 0.322 4
## 2 A3GALT2 protein_coding 0 0 0 0
## 3 A4GNT protein_coding 0.0450 0 0.260 5
## 4 AADAC protein_coding 0 0 0 0
## 5 AADACL2 protein_coding 0.0190 0 0.203 3
## 6 AADACL3 protein_coding 0.0481 0 0.877 2
## 7 AADACL4 protein_coding 0.00427 0 0.0855 1
## 8 AADAT protein_coding 0.0386 0 0.322 5
## 9 AARD protein_coding 0.0827 0 0.520 7
## 10 ABCA12 protein_coding 0.0929 0 0.738 7
## # … with 4,839 more rows
RNA-seq counts are not independent within a library and not comparable across libraries. A library with 1 million sequences will have higher counts for most genes than one with 1 thousand sequences. We correct for this aspect of the data with the following normalization steps.
TMM defines a reference sample from your data set as the one with the most representative expression for the overall data set. Specifically, the reference sample is the one whose upper quartile is closest to the overall data set upper quartile. The upper quantile is the value (x) where 75% of genes have expression < x. Thus, the reference sample is the sample whose x is the closest to mean(x) across all samples.
All other samples are considered test samples. For each test sample, a scaling factor is calculated based on the weighted mean of log ratios of representative genes between the test and reference. These representative genes are a subset of the data set, removing the highest and lowest expressed genes as well as genes with the highest and lowest log ratios. The exact genes used as representative genes for scaling are dynamic and specific to each test sample.
The calculated scaling factors are then applied to the counts table automatically in the voom step.
dat.abund.norm <- calcNormFactors(dat.abund, method = "TMM")
We continue normalization by converting counts to CPM within each sample, thus accounting for differential sampling depth. We also perform log2 transformation, because RNA-seq data are heavily right-skewed and thus, violate assumptions of normality.
as.data.frame(dat.abund$counts) %>%
rownames_to_column() %>%
pivot_longer(-rowname) %>%
ggplot() +
geom_histogram(aes(x=value), bins = 100) +
theme_classic() +
labs(x = "count", y = "occurences") +
lims(x=c(0,1000))
voom performs both of these steps! We use voomWithQualityWeights to additionally calculate sample specific quality weights that can be of use as co-variates in linear modeling.
#define model matrix
mm <- model.matrix(~ condition, data=dat.abund.norm$samples)
dat.abund.norm.voom <- voomWithQualityWeights(
dat.abund.norm,
design=mm,
plot=TRUE)
To get an initial sense of the cleaned data, we plot some variables of interest. Here, we see a clear Mtb infection signal and no relationship to sex.
#transpose normalized expression data
voom.for.pca <- t(dat.abund.norm.voom$E)
#Calculate PCA
PCA3 <- prcomp(voom.for.pca, scale.=TRUE, center=TRUE)
#create labels with percent explained for the PC
PC1.label <- paste("PC1", " (", summary(PCA)$importance[2,1]*100, "%)", sep="")
PC2.label <- paste("PC2", " (", summary(PCA)$importance[2,2]*100, "%)", sep="")
#Merge PCA results with metadata
PCA.meta <- as.data.frame(PCA3$x) %>%
rownames_to_column("libID") %>%
full_join(meta)
## Joining, by = "libID"
#color by condition
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
geom_point(aes(color=condition), size=3) +
labs(x=PC1.label, y=PC2.label) +
#Set theme
theme_classic() +
#set xy ratio
coord_fixed(ratio=1)
#color by sex
ggplot(PCA.meta, aes(x=PC1, y=PC2)) +
geom_point(aes(color=sex), size=3) +
labs(x=PC1.label, y=PC2.label) +
#Set theme
theme_classic() +
#set xy ratio
coord_fixed(ratio=1)
Write final voom object as RData
save(dat.abund.norm.voom, file = "3_RNAseq_cleaning/dat_voom.RData")
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DiagrammeR_1.0.8 edgeR_3.36.0 limma_3.50.3 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 fs_1.5.2 lubridate_1.8.0
## [4] bit64_4.0.5 filelock_1.0.2 progress_1.2.2
## [7] RColorBrewer_1.1-3 httr_1.4.2 GenomeInfoDb_1.30.1
## [10] rprojroot_2.0.3 tools_4.1.2 backports_1.4.1
## [13] bslib_0.3.1 utf8_1.2.2 R6_2.5.1
## [16] DBI_1.1.2 BiocGenerics_0.40.0 colorspace_2.0-3
## [19] withr_2.5.0 prettyunits_1.1.1 tidyselect_1.1.2
## [22] curl_4.3.2 bit_4.0.4 compiler_4.1.2
## [25] cli_3.3.0 rvest_1.0.2 Biobase_2.54.0
## [28] xml2_1.3.3 labeling_0.4.2 sass_0.4.0
## [31] scales_1.2.0 rappdirs_0.3.3 digest_0.6.29
## [34] rmarkdown_2.11 XVector_0.34.0 pkgconfig_2.0.3
## [37] htmltools_0.5.2 dbplyr_2.1.1 fastmap_1.1.0
## [40] highr_0.9 htmlwidgets_1.5.4 rlang_1.0.2
## [43] readxl_1.3.1 rstudioapi_0.13 RSQLite_2.2.12
## [46] visNetwork_2.1.0 jquerylib_0.1.4 farver_2.1.0
## [49] generics_0.1.2 jsonlite_1.8.0 vroom_1.5.7
## [52] RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.7
## [55] Rcpp_1.0.8.3 munsell_0.5.0 S4Vectors_0.32.4
## [58] fansi_1.0.3 lifecycle_1.0.1 stringi_1.7.6
## [61] yaml_2.3.5 zlibbioc_1.40.0 BiocFileCache_2.2.1
## [64] RNAetc_0.1.0 grid_4.1.2 blob_1.2.3
## [67] parallel_4.1.2 crayon_1.5.1 lattice_0.20-45
## [70] Biostrings_2.62.0 haven_2.4.3 KEGGREST_1.34.0
## [73] hms_1.1.1 locfit_1.5-9.5 knitr_1.39
## [76] pillar_1.7.0 biomaRt_2.50.3 stats4_4.1.2
## [79] reprex_2.0.1 XML_3.99-0.8 glue_1.6.2
## [82] evaluate_0.15 modelr_0.1.8 png_0.1-7
## [85] vctrs_0.4.1 tzdb_0.3.0 cellranger_1.1.0
## [88] gtable_0.3.0 assertthat_0.2.1 cachem_1.0.6
## [91] xfun_0.30 broom_0.8.0 AnnotationDbi_1.56.2
## [94] memoise_2.0.1 IRanges_2.28.0 ellipsis_0.3.2